Automatic Transcription Verification of Broadcast News and Similar Speech Corpora

نویسندگان

  • Michael Pitz
  • Sirko Molau
  • Ralf Schlüter
  • Hermann Ney
چکیده

In the last few years, the focus in ASR research has shifted from the recognition of clean read speech (i.e. WSJ) to the more challenging task of transcribing found speech like broadcast news (Hub-4 task) and telephone conversations (Switchboard). Available training corpora tend to become larger and more erroneous than before, as transcribing found speech is more difficult. In this paper we present a method to automatically detect faulty training scripts. Based on the Hub-4 task we will report on the efficiency of error detection with the proposed method and investigate the effect of both manually and automatically cleaned training corpora on the word error rate (WER) of the RWTH large vocabulary continuous speech recognition (LVCSR) system. This work is a joint effort of the University of Technology (RWTH) and Philips Research Laboratories Aachen, Germany.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automatic verification of broadcast news transcriptions

In this paper we present a method for automatically detecting erroneous training scripts for speech corpora like Broadcast News and Switchboard. Based on the Hub-4 task we will report on the performance of error detection with the proposed method and investigate the effects of both manually and automatically cleaned training corpora on the performance of the RWTH speech recognition system. Our ...

متن کامل

An Analysis of Sentence Segmentation Features for Broadcast News, Broadcast Conversations, and Meetings

Information retrieval techniques for speech are based on those developed for text, and thus expect structured data as input. An essential task is to add sentence boundary information to the otherwise unannotated stream of words output by automatic speech recognition systems. We analyze sentence segmentation performance as a function of feature types and transcription (manual versus automatic) f...

متن کامل

A Lightweight on-the-fly Capitalization System for Automatic Speech Recognition

This paper describes a lightweight method for capitalizing speech transcriptions. Several resources were used, including a lexicon, newspaper written corpora and speech transcriptions. Different approaches were tested both generative and discriminative: finite state transducers, automatically built from Language Models; and maximum entropy models. Evaluation results are presented both for writt...

متن کامل

Structural Metadata Annotation of Speech Corpora: Comparing Broadcast News and Broadcast Conversations

Structural metadata extraction (MDE) research aims to develop techniques for automatic conversion of raw speech recognition output to forms that are more useful to humans and to downstream automatic processes. It may be achieved by inserting boundaries of syntactic/semantic units to the flow of speech, labeling non-content words like filled pauses and discourse markers for optional removal, and...

متن کامل

Toward Automatic Recognition of Japanese Broadcast News

In this paper we report on automatic recognition of Japanese broadcast-news speech. We have been working on largevocabulary continuous speech recognition (LVCSR) for Japanese newspaper speech transcription and achieved reasonably good performance. We have recently applied our LVCSR system to transcribing Japanese broadcast-news speech. We extended the vocabulary to 20k words and trained the lan...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999